Code

Imports

Read the whole dataset and reduce it to what we are interested in

Sort it by ['dobdb_family_id', 'earliest_publn_date']

Reduce to years we are interested in

In appln_abstract and appln_title: Replace NaNs with ' '

Just a comment

I got confused because "alkaline" comes up in the terms with decreasing frequency.

Probably because there are also Secondary Alkaline Batteries.

https://www.sciencedirect.com/topics/materials-science/secondary-alkaline-battery (2009):

"In the future, secondary alkaline batteries will come under increasing competitive pressure from lithium-ion systems in an expanding number of applications."

Infer our time frame from data

Of every family, keep only the last english, non-nan title and abstract

Get titles and abstracts counts for each year

Write counts in a dataframe and normalise them

Define stopwords, contexts, equivalents, words to replace, and punctuation

Define a function for taking care of key phrases extraction and counting

Two more definitions

Results

Titles

Titles - unigrams

Titles - bigrams

Titles - trigrams

Abstracts

Abstracts - unigrams

Abstracts - bigrams

Abstracts - trigrams